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Abstract 

An unsupervised and a supervised classification approaches for Hilbert random curves are 
studied. Both rest on the use of a surrogate of the probability density which is defined, 
in a distribution-free mixture context, from an asymptotic factorization of the small-ball 
probability. That surrogate density is estimated by a kernel approach from the principal 
components of the data. The focus is on the illustration of the classification algorithms and 
the computational implications, with particular attention to the tuning of the parameters 
involved. Some asymptotic results are sketched. Applications on simulated and real datasets 
show how the proposed methods work. 

Keywords: density based clustering; discriminant Bayes rule; Hilbert data; small-ball 
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Introduction 

In multivariate classification problems, whether they are supervised or unsupervised, joint 
density, or better its estimate, plays a central role. In order to make it clear, one has just 
to think about the literature on the model based clustering approaches, and recall that all 
the discriminant methods resting on the so-called Bayes rule require an estimation of the 
joint density in each group (for recent developments, see for instance Bock et ah, 2013, 2014, 
Gimelfarb et al., 2012). 

When one deals with data belonging to functional spaces (for a general introduction 
on this topic, one can refer to the monographs of Ferraty and Vieu, 2006, Horvath and 
Kokoszka, 2012 and Ramsay and Silverman, 2005, and to the recent book Bongiorno et ah, 
2014), the dimensionality problem arises immediately, and, as a consequence, a probability 
density function generally does not exist (see Delaigle and Hall, 2010). Hence, a direct 
extension of density oriented classical multivariate classification approaches to functional 
data cannot be implemented: usually, a reduction of dimensionality based on projection over 
suitable finite subspaces, is made as a preliminary step to tackle the problem. This route 
is followed for instance by James and Sugar (2003), where model based clustering methods 
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are proposed, and by James and Hastie (2001) and Shin (2008) in defining a discriminant 
approach: the general idea is to put a suitable density mixture model over the coefficients 
of the representation of functional data in the finite subspace, admitting that such model 
may summarize the distribution of the underlying process. It is worth noting that also other 
dimensionality reduction approaches are possible: for instance, one can recall the techniques, 
which rest on the most important points, illustrated in Delaigle et al. (2012). 

Another way to proceed, that aims at working directly on the distribution of the process, 
refers to the concept of surrogate (or pseudo) density. The general principle, dating back to 
Gasser et al. (1998), is to factorize the small ball probability associated with the functional 
data, when the radius of the ball tends to zero, as a product of two terms: an “intensity 
term” which depends only on the center of the ball, and a kind of “volume parameter” 
which depends only on the radius. Since the first term reflects the latent structure of the 
distribution of the underlying process, it represents an ideal candidate to play the role that 
the multivariate density has in finite dimensional classification methods. Theoretical condi¬ 
tions that allow such a factorization when one considers Hilbert functional data in the space 
determined by the basis of the Karhunen-Loeve decomposition (i.e. by the eigenfunctions of 
the principal components analysis), and the structure of the pseudo-density (linked to the 
principal components, i.e. the coefficients of the decomposition), are discussed in Delaigle 
and Hall (2010) and then in Bongiorno and Goia (2015), where some assumptions are re¬ 
laxed. One can observe that this approach allows to see the projective approaches in a more 
general theoretical context. 

To put into effect the above factorization and take advantage of the pseudo-density, 
different ways are possible. A first one is to specify a suitable density model mixture for the 
principal components: this full parametric approach is followed by Jacques and Preda (2014) 
in defining a Gaussian mixture clustering procedure. On the other hand, a full nonparametric 
approach is possible as done in Ferraty et al. (2012) where a k-NN procedure is proposed to 
estimate the pseudo-density. 

In this paper we consider an intermediate approach for evaluating the pseudo-density: 
after computing the first d principal components of the functional data, we obtain an estimate 
of their joint density fd via the classical Parzen-Rosenblatt kernel method. We can see this 
approach as semi-parametric: if on one hand, we use coefficients of the Karhunen-Loeve 
decomposition in defining the pseudo-density (it is not estimated directly), on the other 
hand, the mixture model is not specified. 

The goal of this paper is to clarify how to use the proposed method to tackle classification 
problems for Hilbert data: we illustrate both a pseudo-density oriented clustering algorithm 
resting on the definition of clusters as a high intensity regions from the modes of fd, and a 
classifier based on the Bayes rule in the discriminant context, under suitable hypothesis on 
the distributions of the mixture components. After introducing the general mixture model 
and the theoretical motivations, the algorithms are illustrated in details, focusing on com¬ 
putational aspects and on the tuning of different parameters involved (as, for instance, the 
number of considered principal components, the scale of the bandwidth matrix in estimating 
the joint density fd and a “mode cap” parameter whose purpose is to prevent the springing 
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of too much spurious modes). The study is completed with an analysis of real and syn¬ 
thetic datasets: a special attention is paid, in both clustering and discriminant framework, 
to mixtures presenting non-spherical clusters. 

The paper is organised as follows: in Section 1 we introduce the mixture model and its 
factorization; Section 2 is devoted to illustrating the classification algorithms (clustering and 
discriminant) and to discussing the parameters used; Section 3 collects the applications to 
simulated data and real cases; Section 4 gathers some final comments; finally, in Appendix A 
we sketch the proofs of the main theoretical results. 

1. Theoretical framework 

Let X be a random element defined on the probability space (fl, J 7 , P), taking values in 
the Hilbert space C^ Q endowed with the usual inner product (•, •) and associated norm || • ||. 
Denote by 

fix = {E [x (t)] ,fe(0,l]}, and E [•] = E [(X - fix, •> {X - fix)} 

the mean function and covariance operator of X respectively. A measure of concentration 
of A" is given by the small-ball probability, briefly SrnBP (see Ferraty and Vieu, 2006 and 
reference therein), defined as 

(p (x, e) — P (||X — x|| < e ), x G £f 0il ], £ > 0. 

Suppose that D is partitioned in G (unknown and finite) sub-sets f2 ff , and let Y be a N-valued 
random variable defined by 

G G 

Y M = ^{Y = g) = n g >0, ^7T g = l, 

9=1 9=1 

(here IU denotes the indicator of A) and consider the conditioned SrnBP 
<p(x,e\g) =F(\\X -x\\ <£ \ Y = g), g = l,...,G, 
that leads to the mixture 

G 

= ^ j 'K g y(x 1 £\g ), x G T[ 01] , e > 0. (1) 

9=1 

The latter expression is the starting point for approaching model-based classification prob¬ 
lems: when Y is a latent variable, we deal with an unsupervised classification problem 
focused on the left-hand side of (1), see Section 2.1; whereas, when Y is observed, the model 
leads to the construction of a Bayesian classifier whose starting point is the right-hand side 
of (1), see Section 2.2. In both cases, instead of tackling it directly, we want to simplify it by 
exploiting an approximation result sketched below (for more details see Bongiorno and Goia, 
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2015). For the sake of simplicity, it is presented with respect to the process X; however, the 
same arguments can be applied to (X\Y = g ) with g — 1,..., G, provided a suitable change 
of notation is made. 

Consider the Karhunen-Loeve expansion of X: denoting by {Aj, the decreasing to 

zero sequence of non-negative eigenvalues and their associated orthonormal eigenfunctions 
of S, it holds 

OO 

X it) = fix (t) + XI (*) > 0 < t < i, 

3=1 

where 9j = (X — px,£j) are the so-called principal components (PCs) of X satisfying 
E [Oj] = 0, Var (0,-) = A,-, E [0&.] = 0, j ± f. 


Without loss of generality, from now on suppose that Hx = 0. Moreover, assume: 

(A.l) the hrst d PCs 6 = { 6 1 ,..., 6d)' admit a strictly positive and sufficiently smooth joint 
probability density f d ; 

(A.2) x is an element of Aj), such that sup{ar-/Aj : j > 1} < oo, with Xj = (x,^); 

(A.3) there exists a positive constant C (not depending on d) for which 


V AiA j 

sup sup 


d 2 f d W 


d'diddj 


<C, 


for any id E D, 


where D = j$ e R d : Y2j<d (^3 ~ x i) 2 — d 2 } ^ or some P > £ ] 

(A.4) the spectrum of £ is rather concentrate: {A J }?h 1 decays to zero exponentially, that is 


X d 1 A j < C, for any d G N. (2) 

j>d +1 


Proposition 1. Under (A.1)-(A.4), as £ tends to zero, it is possible to choose d = d(e) 
diverging to infinity so that: 

ip{x,e)~ f d {xi,...,x d )<t>(d,E). (3) 

From a practical point of view, and as we will see in the sequel, the exponential decay 
required by Assumption (A. 4) is sufficient to reach the scopes of this paper. Nevertheless, to 
better appreciate the interpretation of factors in (3) and, in particular, because the form of 
( t>(d , e) depends on them, we list below the forms of <f associated to the different eigenvalues 
decays. In particular: 
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• if {Aj}^ decays exponentially (2), then 

(j>(d, e) = exp | ^ d [log(27ree 2 ) - log(d) + S(d, a)] j , 

where £(-,•) is such that lim a _ ) . 00 limsup^^ 5(s, a) = 0 and a is a parameter chosen 
in such a way that A^ 1 ^ 2 < ct 2 ; 

• if {A-, }^ decays super-exponentially 


X d ^ } A j —t 0, as d —7* oo 
j>d +1 


( 4 ) 


or, equivalently, X d+ i/X d —>■ 0 (as d —> cx)), then 


4>(d, e) = exp -d [log(2vree 2 ) - log(d) + 5(d)] }> , 


where 5(d) = o(l) as d —» oo; 
if {Aj}}^ decays hyper-exponentially 


*(s>) N) 

\j>d+l / \j<d J / 


— = o (1), as d —y oo 


( 5 ) 


then 


<t>(d,e) 


£ d TT d/2 

V (d/2 + 1)' 


The fact that in the hyper-exponential case, <f>(d, e) is the volume of a d-dimensional 
ball with radius s, justihes to interpret, in Equation (3), 4>(d,e) as a d-dimensional volume 
parameter, whilst f d , being the only factor depending on x G £ 2 0] j, as a surrogate of the 
density of the Hilbert process. 


Remark 2. Note that (5) => (4) =>■ (2) , whereas the vice versa does not hold. For instance, 
for any a > 1 and /3 > 0, X d = exp {—/3d} decays exponentially but not super-exponentially, 
X d = exp {—/3 d hi (In (d))} decays super-exponentially but not hyper-exponentially while X d = 
exp {—/3d"} decays hyper-exponentially. 


Remark 3. Although the results are exposed using the Karhunen-Loeve (or PC A) basis, they 
hold for any orthonormal basis ordered according to the decreasing values of the variances of 
the projections, provided that they decay sufficiently fast (see Bongiorno and Goia, 2015). 
Note that the variances obtained when one uses the PCA basis present, by construction, the 
faster decay: in this sense the choice of this basis can be considered optimal. 
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2. Classification 

This section is devoted to defining classification procedures in a functional framework. 
The idea is to exploit the asymptotic factorization results provided by Proposition 1. It 
is divided in two subsections: the first one illustrates a clustering algorithm, whereas the 
second one deals with a discriminant analysis procedure. 

2.1. Unsupervised classification 

Consider a sample {(Ad, Yf ),..., (X n , Y n )} drawn from (X, Y) defined as in Section 1, 
where X’s are observed while the group variables Y’s are latent. Our aim is to determine the 
range of Y (i.e. G) and, for each observed X t the membership group (that is the value of Yf). 
If the distribution of {X\Y = g) is specified then a full parametric approach applies; this has 
been done, for example, in Jacques and Preda (2014) where the authors used a maximum 
likelihood and expectation maximization approach to identify the distribution parameters 
of a Gaussian mixture assumed for fd- Instead, if no information is available, a distribution 
free model could be used. In this latter view, consider the SmBP mixture (1) and apply 
Proposition 1 to its left-hand side to obtain: 

G 

s ^2'K g y{ : x,E\g) = ip{x,e) ~ / d (xi,...,a; d )0(d,e), x e £f 0) i]> £ -»■ °- 

9=1 

Such expression highlights how the surrogate density fd carries the information on the mix¬ 
ture and, at the same time, endorses a “density oriented” clustering approach on fd as a 
fruitful tool in detecting the latent structure; hence, from now on, we assume that there 
exists a positive integer d* such that fd is G-modal for any d > d*. Therefore, our scope 
is to identify the groups by a “locally high (surrogate) density regions” principle: the clus¬ 
tering algorithm computes the estimates rhd, g of the modes m^, that is the local maxima 
of f d , whose number G estimates G; for each g, it finds the largest connected upper-surface 
containing only m^ 9 , and hence it assigns group labels to each observation consistently with 
its proximity to these sets. The algorithm procedure is illustrated below: 

Step 1. Estimate the covariance operator and the eigenelements. 

Step 2. Fixed d, compute fd, n (an estimation of the joint distribution density fd). 

Step 3. Look for the local maxima frid. g of fd iH , 9 = 1, • • •, G over a grid. 

Step 4. Finding Prototypes: for each g in {1,... ,G}, the g-th “prototypes” group is formed 
by those X J whose estimated PCs (0 1; j,..., 6 d) i) belong to the largest connected 
upper-surface of fd, n that contains only the maximum fhd, g . In other words, for 
such individual, Y % — g. 

Step 5. Assign each unlabelled X t to a group by means of a k-NN procedure (with k — 1). 
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As a by-product of such algorithm, it is possible to define a center for each cluster through 
the “d-dimensional modal curves” built from rh d , g and defined as follows: 

d, 

3 = 1 

where is the j-th term of m c i, g and £j(t) are the empirical versions of £j(t). 

In the remaining part of this section, some theoretical and practical aspects of the algorithm 
will be discussed. 


2.1.1. Surrogate density estimation 

In order to estimate f dl we consider the classical multivariate kernel density estimator: 


fd,n (iW) =-J2 K h( \\nd(X,, 


i —1 


X) 


u d x e 


where Kh (u) = det {H)~ 1/2 K K is a kernel function, H is a symmetric semi- 

definite positive d x d matrix and, hp (•) = J2j=i is th e projection operator over the 

subspace spanned by j^i,... ,^|, i.e. the hrst d eigenfunctions of E n (the sample version 
of E) so that the kernel argument depends on the estimated PCA semi-metric (Ferraty and 
Vieu, 2006, Section 8.2). It is worth noticing that in estimating f d the use of the estimated 
projector hp instead of hp introduces a non-standard source of noise that might modify the 
consistency properties and, hence, should be considered with care. In fact, from a theoretical 
point of view, one may wonder if such estimator is consistent for f d and, when this is the 
case, if it attains the same rate of convergence that holds when Ip is known. A positive 
answer was provided in Bongiorno and Goia (2015); in particular, consider the special case 
H n = . and suppose that: 

(B.l) the density f d (x) is positive and p times differentiable at x G 
(B.2) h n is such that h n —>• 0 and nh^f log n —* oo as n —>■ oo; 

(B.3) the kernel K is a Lipschitz, bounded, integrable density function with compact support 

[ 0 , 1 ]; 

(B.4) the process X is bounded. 


Then, the following result holds: 

Proposition 4. Assume (B.l)-(B.4) with p > (3d + 2) /2 and consider the optimal band¬ 
width 

c\n~ 2 p+ d <h n < c 2 n _ 2 p+ d (6) 
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where C\ and c 2 are two positive constants. Thus, as n goes to infinity, 


E 


2 


fd (x) ~ fd,n (x) 


O (n~ 2p/( - 2p+d) ) , 


uniformly in M. d . 

From a practical point of view, the bandwidth selection is an important task: our choice 
is to consider a diagonal bandwidth matrix (see Duong and Hazelton, 2005 for a heuristic 
justification) whose non-null entries are the univariate bandwidth provided by Silverman 
(1986, p.48). Anyway, it is clear that the larger is \H\, the “smoother” is fd,m the smaller 
is the number of modes and hence the number of groups. In other words, different choices 
for the bandwidth may be considered in order to catch different phenomenon scales; this is 
done by applying to H a scale factor 5 > 0, whose optimality (in a sense to be specified) is 
discussed below. 

2.1.2. Prototypes identification and Modes 

The identification of prototypes is the core of the algorithm: it rests on the largest 
connected upper-level sets of f^ n related to each mode fh^ g . The latter plays just an instru¬ 
mental part in identifying prototypes. More in details, we use the graphical visualization 
system of R software: A, : is assigned to the g -th prototype group if its estimated PCs be¬ 
long to the r/-th upper-level sets, by using the algorithm described in Liu et al. (2010) and 
available in the R-package ptinpoly. From a practical point of view, a problem concerns 
spurious modes caused by the choice of 5 and sampling variability. To tackle this issue, 
we select only those modes fhd, g for which fdifhd,g) is maximum over the parallelepiped 
rhd, g ± [Opr/?!] x ... x [0,r/ij, where hj is the resolution along the j-th direction of the 
grid used to estimate the density, while r is a positive integer, playing the counterpart of a 
tolerance coefficient. 

It is worth recalling that, in the multivariate literature, alternative techniques in detecting 
high density regions can be found (see for instance, Azzalini and Torelli, 2007, Sager, 1979, 
Rinaldo et ah, 2012 and references therein). 

To conclude, we provide a glimpse of the theoretical aspects about mode estimation. For 
a fixed d, consistency properties for estimated modes have been considered, for instance, in 
Abraham et al. (2003), Chen et al. (2015) and Sager (1979). Nevertheless, such theoretical 
issues are little studied in the infinite dimensional framework, since the mathematical concept 
of the density is still under-developed; some attempts can be found in Gasser et al. (1998) 
and, more recently, in Bongiorno and Goia (2015), Dabo-Niang et al. (2007) and Delaigle 
and Hall (2010). 

2.1.3. Tuning parameters 

Given a data set, different values of d, r and S may lead to different cluster results: thus, 
in many situations, a criterion to choose them is opportune. In literature it is acknowledged 
that, for the most clustering algorithms, there is not a selection “golden rule” of the parame¬ 
ters. In practice, a series of trials and repetitions are performed to tune the parameters (see 




Xu and Wunsch II, 2005 and references therein). In this view, automatic selection rules can 
be used only as support in decisional steps. 

In view of Proposition 1, parameter d should be large enough to guarantee a good approxima¬ 
tion for the SmBP (3), but small enough to avoid the well-known “curse of dimensionality” 
in estimating non-parametrically fd . A good compromise is to choose d so that the Fraction 
of Explained Variance, FEV(d) = ^2j< d ^j/^2j >i Ab larger than a suitable constant. In 
practice, FEV is estimated from eigenvalues of X. 

For what concerns a choice for (r,5), and to provide some insights on the quality of 
clustering solutions, we exploit both external and internal criteria although they are not 
universal and effective tests for an optimal choice. In particular, we implement: the purity 
index (based on some prespecified structure, which is the reflection of prior information on 
the data), the “Calinski and Harabasz”, or briefly “CH”, index (that does not depend on 
external information) and, when feasible, a combination of the two indices. Thus, accord¬ 
ingly to the nature of data, the idea is to look at those (r, S) which furnish the best value for 
these validation criteria and, as a consequence, suggesting the number of clusters G. In the 
remaining part of this section, we summarise these two criteria; more details can be found 
in Xu and Wunsch II (2005) and references therein. 

The purity index measures how close a clustering is to an available pre-specihed class struc¬ 
ture and, more precisely, the extent to which each cluster consists of objects from a single 
class. In particular, for each cluster, consider the class distribution of the data; i.e. for class 
j compute Pgj, the frequency a member of cluster g belongs to class j as p g j = ^ g j/^ gi where 
7 T g is the proportion of objects in cluster g, and 7 T g j is the proportion of objects of class j in 
cluster g. Hence, for each cluster g, purity is calculated as 

P g = PgiVi d) = ma x{p gj : j — 1,, L}, 

where L is the number of pre-specihed classes, whilst the total purity is the sum of the 
cluster purities weighted by the size of each cluster p = ^ g=1 Pg^g- Clearly, p ranges in [0,1] 
with p — 0 meaning maximum separation and p = 1 maximum cohesion. 

Among the internal validation indices, the CH index is well-known and often achieves the 
best performance (see Dubes, 1993). It is defined as 

f Tr(S B ) / Tr(S w ) K , 

CH = CH(8,r) = < K ~ 1 / n ~ K ’ ’ 

l 0, K = 1. 

where N is the sample size, K is the number of clusters obtained by choosing the couple 
(<5, r), and Tt(Sb) and Tr(Sw) are the traces of the estimated between and within covariance 
matrices, respectively. The couples (5,r) that maximize CH are selected as optimal, and 
the number of clusters K is consequently obtained. 

It is worth noticing that purity is not affected by the geometry of the point clouds, whereas 
CH is. In particular, since CH definition is based on variances, it is expected that CH 
provides good results whenever the clusters are elliptical point clouds and not otherwise. 
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2.2. Supervised classification 

In discriminant analysis, differently from clustering, the presence of G distinct groups is 
established and modelled by the observed variable Y: the aim is to label each new incoming 
observation according to this known group structure. To do this, a typical approach is to use a 
Bayes classification rule: given an observation x, one assigns it to the class y(x) G {1,..., G} 
to which corresponds the highest a posteriori probability P(Y = 7 ( 0 ;) | X = x): 


y(x) = arg max P(F = g\X = x ). 
g=l,...,G 


Equivalently, y(x) is the index g' in {1,..., G } such that 
P(y = g'\X = x) ^ 

—— -—-- > 1, lor any q = 1,..., G and q q . (7) 

P(Y = g\X = x) ’ J y ’ ’ y r y v ; 

If a probability density of A" in the g- th group f(x\g) were known (with f(x\g) > 0), thanks 
to the Bayes formula, Equation (7) would simplify as follows: 


v/W) 

^gf(x\g) 


for any g ± g', 


and, consequently, the classification rule would become: 


q(x) = arg max n g f{x\g). 

9 = 1.Cr 

It is clear that such arguments do not apply straightforwardly in functional settings without 
further assumptions on the probability measures. A possible way to tackle the problem is to 
consider the following classification rule: assign a new functional observation x to the r/-th 
group for which, as £ tends to 0 , 


P(Y = g | 

1 

H 

A 

O) 

P(Y = g' | 

1 

A 


for any g ^ g. 


( 8 ) 


At a glance, it is evident that it is hard to use in practice. Anyway, thanks to the Bayes 
formula, the ratio in ( 8 ) becomes 

ngip (x,e| g) 

7 t 9 '^( x ,£| g'Y 

Whenever assumptions of Proposition 1 hold for each (X\Y = g) with g = 1 ,... ,G, then 
the above ratio reduces to 


TT g fd g (x\g)<j)(dg,£) 

v /, s/ (x| as£ ’ 

where fa (x| g) is the joint density of the first d g PCs computed using the (conditional) 
covariance operator Y g of the group g. Clearly, the asymptotic behaviour of the ratio is 
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related to the trade-off between the volume parameters 0 (-,e) and the probability densities 
evaluated at possibly different dimensions d g and d g >. 

The classification rule may be further simplified if additional assumptions are imposed 
on the mixture process A", since the spectrum decay of £ controls the one of each £ 9 (the 
conditional covariance operator corresponding to the g-th group), the starting point to 
build fd . In particular, consider the variance decomposition £ = B + W, where B and 
W = G 7 t 9 £ s represent the between and within covariance operator respectively. A 
straight application of the Courant-Fischer-Weyl min-max principle for linear operators 
leads to 


X k (M 1 )<Xk(M 1 + M 2 ), 


where {Aj(M,)} denote the eigenvalues of the linear operator M* (in decreasing order). The 
latter inequality ensures that the eigenvalues of B,W and £ ff (g — 1,..., G) have a decay fast 
at least as the one of £. In other words, since the eigenvalues’ decay rate is a measure of how 
much X is concentrated in the space, the process (X\Y = g ) in each sub-population must 
be “concentrated” at least as much as X. As a consequence, if the spectrum of £ decays 
exponentially (according to (2)), then d g can be chosen equal to d for any g = 1 ,... ,G, 
(f>(d, e) simplifies, and one can write the classification rule ( 8 ) similarly to the multivariate 
case, by replacing a probability density with a surrogate version: assign a new functional 
observation x to the g-th group for which, as e tends to 0 , 


Kgfd (x\g) > 1 

n g 'fd{x\g') 

or, equivalently, as d tends to +oo, 


for any g' e G}, g' ± g, 


l{x,d) = arg max n g f d (x\g). (9) 

g=l,...,G 

Operatively, if one could specify the conditional densities fd(x\g), a full parametric approach 
would be possible. Although ^(x, d) still depends on e by means of d (see Proposition 1), it 
is not so restrictive to assume that 


(A.5) 7 (x, d) is constant as d goes to infinity; i.e. there exists a positive integer d* such that 
y(x, d) = j(x, d*) for any d > d*. 


This assumption holds, at least, in the case of a finite dimensional process. 

At this point, a comparison of our approach with the one introduced in James and Hastie 
(2001) is interesting, and one can trace some parallelism. Indeed, in both approaches, a 
dimensionality reduction step based on projection onto a finite vector subspace generated by a 
previously chosen basis is implemented, and so the classification rule involves the conditional 
joint densities of the projection coefficients. Moreover, if d g = d for all g , and one assumes an 
underlying Gaussian mixture model, both approaches lead to the same classifier: the present 
section theoretically justifies the use of (9) in a finite dimensional subspace. However, if 
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the eigenvalues of £ decay slowly, we cannot ensure that d g is the same varying g, and 
the volume terms (j>(d g ,s) cannot be neglected in the classification rule. Consequently, the 
approach based on a pure projective method could be unfruitful. 

The illustrated method can be framed by the full parametric discrimination described 
in James and Hastie (2001) and the full nonparametric one proposed by Ferraty and Vieu 
(2003), where the a posteriori probability is estimated directly by a kernel regression ap¬ 
proach. 


2.2.1. Estimate classifier 

Once d is chosen, since we want to work in a distribution-free context, densities fd(x\g) 
have to be estimated. Consider a sample {(JO,!{),..., (X n , Y n )} from (X, Y), with d g = d 
for each g — 1,..., G; a kernel density based estimator for (9) is given by: 


^ n 

7 n (x, d) = arg max — ^I { y i=g} K Hg f U g>d (X, - x ) J 


9 2—1 


( 10 ) 


where n g = 5 ^”=] ^{F= 3 } i s th e number of observations coming from group g, 7r g = n g /n 
estimates the mixture coefficient 7 r g , K Hg (u) = det K ^H g l ^ 2 u^, with K a kernel 

function, the bandwidth matrix H g is d x d and symmetric semi-definite positive, and finally 
II 9td is the projection operator over the subspace spanned by the first d eigenfunctions of the 
sample covariance operator £ g for the group g. 

Note that some simplifications may occur in (10). For instance, if the groups are balanced, 
that is n g = 1/G for each g, then one can drop n g and n g . Another example concerns the 
homoscedasticity case (i.e. £ g is the same for each g), where U g)d = 13^ is the projector over 
the space spanned by the first d eigenfunctions of the within (or pooled) covariance operator 

w = Y. c ,LA%. 

For what concerns the choice of d and H g , one can refer to the discussion in Section 2.1.3. 
It is scarcely necessary to observe that, in the discriminant context, coefficients r and 5 are 
not necessary, because we have a specific bandwidth matrix for each group and the main 
goal is not the estimation of a mode. 

To complete the analysis, we study the asymptotic properties of the classifier defined 
in (10). In particular, we consider the Bayes probability of error 


L* = min P (7 (X) ^ Y) 

7 

and the conditional probability of error 


L n = P (7n (x, d) ^ Y I {(X 1; Y x ) ,..., (X n , Y n )}) 


and we study how L n behaves when n tends to infinity. In particular, convergence of L n 
to the Bayes error probability L * is stated in the following proposition, which is a direct 
consequence of the results by Devroye (1981), Section 5. 


Proposition 5. Take H g = h g I. Under assumptions (A.1)-(A.5), L n converges to L * in 
probability, as n tends to +00. 
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3. Applications to synthetic and real data 

This section concerns the simulations and applications of the previously described meth¬ 
ods: the first two subsections are dedicated to clustering (3.1 considers an experiment under 
a controlled set-up, whereas in 3.2 we apply clustering to a real dataset), and the last one 
(Section 3.3) is dedicated to discriminant analysis. 


3.1. Clustering: simulation examples and comparison with competitors 

In the following, a simulation study provides a quantitative comparison of the presented 
algorithm versus competitors. Although the methods are unsupervised, we evaluate their 
ability in detecting the underlying group structure by measuring a misclassification error, 
as if it was a supervised exercise. Besides, by construction the SmBP clustering provides 
an estimate of the number of clusters that must be studied as well, it being a source of 
noise. As pointed out in Section 2.1.3, both misclassification error and number of detected 
clusters depend on the choice of parameters: keeping this in mind, the simulation exercise 
is coherently calibrate. 

In order to generate the dataset, we use the functional basis expansion: 

L 

X- 9 \t) = Y Vfii t e [ 0 , 1 ], i = 1,..., n g and g = 1 ,..., G, 

1=0 


where = 0.7 x 3 1 (Z = 1,..., L = 150) and is the Z-th element of the Fourier basis 




\/2sin(27rmt — 7r), l = 2 m — 1; 
y/2cos(2Trmt — 7r), l — 2m. 


The mixture is controlled by means of r^’s. Here, we deal with G — 2 and, to avoid 
spherical shaped groups, uncorrelated but dependent coefficients are generated as 

follow: 

' r- 9 i = sin(^)cos(p {s=2} ) + ae iA 
r ,-2 = sin(tfj) sin(fI {g=2} ) + ae i}2 
T-f = cos (tfi) + ( —k ) 9 + ae ii3 
k rjj = VoTle^i, 4 < Z < L 

with ($j) i.i.d. as a Beta(5,5) scaled on [— 7 r, 7 r] and i A/"(0,1). In other words, 

( T if)i=i are Cartesian coordinates of the spherical ones (1, 0\ 9 \ |I{ s = 2 })f=i plus a vertical 
translation (—k) 9 and a Gaussian noise e (randomness is confined in the polar angle i? and 
in the noise e). In particular, limited to the first three components, we deal with two 
noised semi-circumferences laying on orthogonal planes, with unitary radii, whose centers 
are ( 0 , 0 , ±k) and chosen so that the clouds of points of (r-f)f =l look like two interlocked 
horseshoes. A reasonable range for k is (0,1): outside this range, the two un-noised groups 
can be separated by means of a plane, a structure easily identifiable. Concerning cr, one can 
choose (0,fc/3) to avoid that groups overlap too much due to noise variability. 
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With such choices, we are concentrating the process along three orthogonal directions so that 
the PCs tend to replicate the r’s structure. Moreover, this setting ensures that Proposition 1 
applies. In fact, the eigenvalues (of S) decay faster or at least equally to {PiVar(ri)}f =1 . Due 
to boundedness of Var(ri), it inherits the same decay type of {Alfki that is exponentially 
(2) with C = 1/3. 

In our simulations, we consider rq = n 2 = 300, a = a/0.005 and k = 0.5. This setting 
leads to have FEV(3 ) always greater than 99%, that suggests us to fix d = 3. Curves are 
generated over a grid of 100 equispaced points on [0,1]. For the sake of illustration, Figure 1 
depicts the scatter plot of an observed set of (rij)f =1 , a selection of the corresponding curves 
and the prototype regions obtained with our algorithm when 8 = 2 and r = 5 (that produced 
G = 2). 



Figure 1: Left to right: simulated coefficients a sample of simulated curves and upper level sets 

associated to the estimated modes in the factor space. 

We generate 400 Monte Carlo samples according to the above setting. To each replication, 
we apply the SmBP clustering that returns the corresponding estimated number of clusters G 
and the misclassihcation error. According to FEV criterion we set d = 3, and we explore the 
behaviour of the algorithm when 8 = 0.6,1,1.4 and r = 1, 5,10. The following competitors 
are considered: 

(KM) the functional /c-means clustering (see Febrero-Bande and Oviedo de la Fuente, 2012) 
with G = 2; 

(GM) the EM clustering method based on a mixture model of Gaussian components applied 
to the first three PCs. We consider both G = 2 and G estimated by means of a 
Bayesian Information Criterion (BIC). The algorithm is coded in package Rmixmod 
(see Biernacki et ah, 2006). 

Table 1 collects the main results. For SmBP clustering, we report the mean and the standard 
deviation of misclassihcation errors and the 0.5, 0.75, 0.9 order quantiles of G varying 8 and 
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r; the same results are provided for (GM) combined with BIC, and only the misclassihcation 
error whenever G is fixed. Such results show that there exists an optimal configuration of 
the parameters 6 = 1.4 and r = 10 for the SmBP clustering for which the two clusters are 
correctly recognized at least in 90% of the cases, with an average misclassihcation error equal 
to 8.8%. It can be noted that the parametric method (GM) produces good results whenever 
G is fixed, but gets worse when G has to be estimated, since BIC overestimates the number 
of clusters. 


Algorithm 

Parameters 

Miscl. 

Error 


G 



5 r 

Mean 

St. Dev. 

< 70.5 

< 70.75 

< 70.9 


0.6 1 

0.676 

0.060 

11 

13 

14 


5 

0.480 

0.112 

6 

7 

8 


10 

0.126 

0.140 

2 

3 

3 


1 1 

0.563 

0.088 

7 

8 

9 

SmBP 

5 

0.372 

0.141 

4 

5 

6 


10 

0.081 

0.144 

2 

2 

3 


1.4 1 

0.463 

0.107 

5 

6 

7 


5 

0.299 

0.146 

4 

4 

5 


10 

0.088 

0.174 

2 

2 

2 

# clusters G 

KM 

2 

0.377 

0.068 

— 

— 

— 

GM 

2 

0.153 

0.106 

— 

— 

— 


BIC selection 

0.666 

0.034 

9 

10 

11 


Table 1: Misclassification errors of SmBP clustering versus competitors and (when available) quantiles of 
the estimated number of clusters. 


3.2. Clustering: real data illustration 

We illustrate how our clustering technique (from now on, SmBP clustering) works when 
applied to a real dataset. The aim is twofold: on one hand, it shows the cognitive support 
that the method could bring to the studied phenomenon and, on the other hand, what kind 
of practical problems could occur and how to treat them. 

The presentation goes through three datasets belonging to different domains: spectro- 
metric analysis, energy consumption and neuroscience. 

3.2.1. Spectrometric curves 

Spectroscopic analysis is a fast, non-destructive and inexpensive technique which provides 
an estimate of the composition of an aliment based of the absorption of light emitted with 
different wavelengths by a spectrometer. Since the measure of absorption is a function of the 
wavelength, it represents a typical functional data. In the last two decades, various functional 
techniques have been widely explored for this kind of data: see for instance, Ferraty and Vieu 
(2006), Delaigle et al. (2012) in the supervised classification framework. 
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Wavelength 


Wavelength 


Figure 2: Tecator curves (left) and their 2-nd derivatives (right). 


In the following, we illustrate an application of the SrnBP clustering method to the 
well-known Tecator dataset (available at http://lib.stat.cmu.edu/datasets/tecator). 
It consists of 215 spectra in the near infra-red (NIR) wavelength range from 852 to 1,050 
nm, discretized on a grid of 100 equispaced points, corresponding to the same number of 
finely chopped pork samples. Fat, protein, and water content, obtained by a traditional 
chemical analysis, is available for each sample. As conventionally done, in our study we 
consider the second derivatives of spectrometric curves instead of the original ones, to avoid 
the well-known “calibration problem” due to the presence of shifts in the curves (see Ferraty 
and Vieu, 2006). Original spectrometric data and their second derivatives are visualized in 
Figure 2. 

Since spectrometric data should represent a way to determine the chemical composition 
of the meat, the structure of the distribution of chemical components should be reproduced 
by the one of spectrometric curves. In this view, we first study the available chemical 
measures. The correlation analysis shows that the three components are highly linearly 
correlated: in particular, fat and water present a linear correlation coefficient equal to —0.988, 
whereas the content of protein exhibits a positive correlation with fat (0.814) and a negative 
one with water (—0.861). This suggests using PCA in order to summarize the chemical 
composition: in that way, the first PC explains the 98.5% of the total variability. Observing 
the kernel estimate of the density of this PC (see the upper panel in Figure 3), the poly- 
modal distribution suggests that the sample is a mixture of three kinds of meats: the three 
groups are detected by considering the largest upper level sets containing the modes of the 
estimated density that, in the one dimensional case, reduces to look for the local minima 
whose abscissa identify class boundaries. Hence, it is expected that the spectrometric curve 
distribution presents a three modal structure as well as, that should be detected by the 
clustering algorithm defined in Section 2.1. 

After running a functional PCA on the second derivatives of spectrometric curves, we 
find that the spectrum is rather concentrated: the first three PCs explain 98.5% of the total 
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Figure 3: Density estimate of the first PC of chemical components and the same stratified according to 
clusters analysis carried out with r = 5 and <5 = 0.2. 


variability, and this suggests to use d — 3 in our approximation. The selection of parameters 
r and 8 is performed according to the maximization of CH index over a bivariate grid built 
with r = 1,..., 7 and 8 varying from 0.2 to 1 with step 0.1. Index CH reaches its maximum 
for r = 5 and 5 = 0.2, to which it corresponds k = 4 clusters. In order to understand the 
appropriateness of this choice, besides the internal criterion CH we use an external criterion 
by computing, for each couple of parameters r, 5 , the index of purity according to the three- 
group structure shown by the first PC which summarizes the chemical variables. It emerges 
that the couple (r, 8) maximizing CH provides also a high degree of purity: this fact can be 
appreciated by inspecting Figure 3, and it provides a heuristic support of the possibility of 
reproducing the main features of the distribution of the chemical measures from the one of 
the spectrometric curves. 

3.2.2. District heating load-curves 

A district-heating system (or “teleheating”) allows the distribution of heat, generated in 
a centralized location, for entire districts through a network of insulated pipes. Due to its 
efficiency and to the pollution control, this system is spreading to many cities. In order to 
guarantee an optimal scheduling for generating heat, which allows choosing the right mix of 
on-line capacity, the analysis of the flows of heating demand is crucial. These flows depend 
mainly on two factors: an intra-daily pattern of the load demand, known as the load curve, 
and seasonal aspects. 

To manage data from a district-heating system, also in a forecasting perspective, it is 
useful to stratify the set of load curves into a few homogeneous groups exhibiting similar 
demand patterns, since consumers characteristics are very different according to seasons and 
weather conditions. In what follows, we propose an application of our clustering algorithm 
to data on heat consumption in Turin, a northern Italian city, where the district heating 
is produced through a co-generation system. The dataset has been used previously in a 
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Figure 4: Demand of heat in a selected week of November 2002, January and March 2003. 


forecasting context based on regression approaches (see Goia et ah, 2010, Goia, 2012). 

The dataset consists of hourly measurements of heat consumption for residential and 
commercial buildings during the periods October 15 - April 20, covering the years 2001- 
02, 2002-03, 2003-04 and 2004-05. Due to privacy requests from the data supplier, the 
data have been normalized. Figure 4 displays the behaviour of the heating demand in three 
selected weeks in autumn, winter and spring: it is possible to distinguish the intra-daily 
pattern, due to an inertia in the demand reflecting the aggregate behaviour of consumers, 
as well as the seasonal evolution. Differently from electricity power demand, intra weekly 
differences among working days and weekends do not appear. 

Taking advantage of the functional nature of the dataset, we split the series for each period 
into 187 functional observations, each one coincident with a specific daily load curve. Finally, 
we dispose of a functional dataset consisting of 748 curves discretized over an equispaced 
mesh of 24 points. Figure 5 displays the set of curves. 

The first step of our procedure is to perform a functional principal components analysis. 
It emerges that the spectrum is rather concentrated: the first three PCs explain more than 
the 97% of the total variance, and thus it is sufficient to limit our analysis to d < 3. In order 
to provide an interpretation of the contribution of the relevant PCs, we exploit a graphical 
tool where we report the estimated mean curve plus and minus a suitable multiple Mj of 

each estimated eigenfunction: p ± Mj ij\J~\j (see e.g. Ramsay and Silverman, 2005). The 
results, visualized in Figure 6, show that the first eigenfunction, which does not present 
sign changes, describes a vertical shift effect, due to weather conditions in seasons, whereas 
the second eigenfunction highlights differences among demand in the morning and in the 
remaining part of the day: it seems to be related to the heat retention ability of buildings 
(a greater heating in the morning produces less need in the afternoon). Finally, the third 
eigenfunction seems to be connected and counter-posed to the three peaks of demand that 
appear systematically during the day in the morning, in the afternoon and in the evening. 
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Figure 7: Calendar positioning of clusters against the daily mean load, corresponding modal curves and 
relationship among clusters and daily mean temperature. Along the panels, a cluster is identified by the 
same grey level. 


Iii order to apply the SmbP cluster algorithm, one has to preventively select the param¬ 
eters r and S. Using the same grid as in Section 3.2.1, the CH criterion suggests r = 3 and 
5 = 0.5, a choice that leads to k — 6 clusters. These clusters reflect the differences in level 
and behaviour of daily demand of heating in the different seasons: high levels of demand in 
winter with a strong peak in the morning, moderate level in autumn and spring with load 
curves presenting three peaks (in the morning, in the afternoon and in the evening). To 
better understand the effect of clustering, in Figure 7 we report the calendar positioning 
of each element of the clusters (each point represents a specific load curves, synthesized by 
its daily average) and alongside the modal curves, plotted using the same level of grey. We 
also report the box-plots resulting after a stratification of the daily mean temperature by 
the cluster labels: one can note that the temperature, without presenting a multi-modal 
density, is one of the most important external variables that can be used in such a clustering 
exercise. Matching the results, we recognize the typical patterns for winter and mid-seasons, 
distinguish freezing, cold and mild days. Finally, it emerges that, if one was to set up a fore¬ 
casting model, an accurate prediction of temperature would be the key to making accurate 
prediction in the demand of heating. 

3.2.3. Neuronal experiment 

The analysis of neuronal spiking activity, recorded under various behavioural conditions, 
is a central tool in neuroscience: data acquired from multiple neurons are essential to explain 
neural information processing. The problem is that contributions of multiple cells must be 
disentangled from the background noise and from each other in order to analyze the activity 
of individual neurons. The procedure that allows distinguishing the activity of one or more 
neurons from a noisy time series is known as spike sorting. 

In this section we show how the SmBP clustering can contribute to the spike sorting: 
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Figure 8: Neuronal Experiment (left to right): a random sample of 30 curves, 3-D Scatter plot of the first 
three Principal Components and the corresponding maximal level set when r = 7 and S = 0.8. 


each detected cluster can be thought to correspond to the activity of a single neuron. The 
dataset comes from a behavioural experiment performed at the Andrew Schwartz motorlab 
(http://motorlab.neurobio.pitt.edu/index.php) on a macaque monkey performing a 
center-out and out-center target reaching task with 26 targets in a virtual 3 D environment 
(see Todorova et ah, 2014 for a detailed description of the experiment considered). The 
neural activity recorded consists of all the action potentials detected above a channel-specific 
threshold on a 96-channel Utah array implanted in the primary motor cortex. The data set is 
split into 1000 functional data representing the voltage of neurons versus the time, discretized 
over a grid of 32 equispaced time points (normalized between 0 and 1). A sample, of 30 
selected randomly curves, is shown in the left panel of Figure 8. An analysis of (a larger set 
of) these curves can also be found in Todorova et al. (2014). 

Performing the functional PCA on such a set of curves, we observe that also in this 
case the spectrum is concentrated: the first PC explains 78.6% of the total variability, and 
the explained variance by the first three PCs is about 96.4%. Observing the 3-dimensional 
scatterplot of the first three PCs (see Figure 8, left panel), three clouds appear evidently. 

In order to detect a good choice for the parameters r and 8, we perform our clustering 
algorithm with d — 3 over a grid built from r = 2,..., 7 (with step equal to 1), and 
5 = 0.2,..., 1 (with step 0.1) and compute the CH indexes. The analysis of the obtained 
values leads to various admissible configurations for the couple (r,5), to which there always 
correspond k = 3 clusters: for instance r = 3,... ,7 combined with 5 = 0.8, 0.9,1 produce 
the same number of clusters and the maximal CH. The middle panel of Figure 8 shows the 
maximal level set when one uses r = 7 and 8 = 0.8. 

The result of the clustering procedure is visualized in Figure 9, where the centers of 
prototypes (the modal curves) and the corresponding clusters are reproduced. 
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Figure 9: Neuronal Experiment: clusters and modal curves when r = 7 and 6 = 0.8. 


3.3. Discriminant: simulation arid real data illustration 

The aim of this section is to assess the performance of the supervised classification algo¬ 
rithm illustrated in Section 2.2 (briefly, SrnBP classifier) by the analysis of both simulated 
and real datasets. The experiments consist in computing the (empirical) distribution out-of- 
sample misclassihcation error by a two-fold cross-validation procedure repeated 100 times: 
more in details, for each available sample, at each iteration, 2/3 of the data are used in 
evaluating the classifier, and the misclassihcation error is estimated on the remaining part. 
The estimation of the density in each group is performed by a multivariate kernel density 
estimator with H diagonal, selected according to Section 2.1.1. 

According to the comments in Section 2.1.3, classifiers are implemented with d = 2,... ,5, 
where d = 5 is considered only to see how the results worsen due to the curse of dimension¬ 
ality. The out-of-sample errors are compared with those computed by using parametric and 
nonparametric competitors: the GLM classifier based on the coefficients of a basis represen¬ 
tation for functional data, nonparametric discrimination using kernel (NP in the following) 
and k-NN estimation, based on the classic L 2 metric (see Ferraty and Vieu, 2006). All com¬ 
putations are done with the software R: in particular, the competitor algorithms are taken 
from the package fda. use (see Febrero-Bande and Oviedo de la Fuente, 2012). 

3.3.1. Analysis of simulated data 

The dataset are generated according to the two “horseshoes” model described in Sec¬ 
tion 3.1, with the following settings: 

- sample size n = 150, 300,450 with training-sets of size n in = 100, 200, 300; 

- we consider both the balanced case (it\ = 1/2), and two unbalanced cases (with 7Ti = 1/3 
and 7Ti = 1/4); 

- the vertical translation parameter k equals 0.5; 


22 




^in 

cr 

^out 

d = 2 

d = 3 

d = 4 

d = 5 

GLM 

k-NN 

NP 


0.05 


0.155 

0.026 

0.039 

0.058 

0.361 

0.024 

0.042 




(0.039) 

(0.015) 

(0.024) 

(0.035) 

(0.060) 

(0.020) 

(0.023) 

100 

0.10 

50 

0.228 

0.095 

0.111 

0.134 

0.345 

0.104 

0.130 




(0.052) 

(0.032) 

(0.040) 

(0.044) 

(0.054) 

(0.041) 

(0.041) 


0.15 


0.257 

0.188 

0.195 

0.221 

0.397 

0.159 

0.200 




(0.061) 

(0.045) 

(0.050) 

(0.051) 

(0.064) 

(0.049) 

(0.048) 


0.05 


0.185 

0.033 

0.042 

0.057 

0.307 

0.041 

0.047 




(0.042) 

(0.017) 

(0.017) 

(0.019) 

(0.031) 

(0.018) 

(0.022) 

200 

0.10 

100 

0.275 

0.150 

0.154 

0.167 

0.335 

0.136 

0.148 




(0.053) 

(0.049) 

(0.027) 

(0.028) 

(0.029) 

(0.038) 

(0.031) 


0.15 


0.255 

0.228 

0.234 

0.265 

0.440 

0.232 

0.250 




(0.039) 

(0.033) 

(0.031) 

(0.035) 

(0.047) 

(0.040) 

(0.036) 


0.05 


0.157 

0.020 

0.033 

0.036 

0.356 

0.030 

0.035 




(0.029) 

(0.007) 

(0.011) 

(0.010) 

(0.033) 

(0.010) 

(0.011) 

300 

0.10 

150 

0.217 

0.100 

0.117 

0.128 

0.383 

0.093 

0.108 




(0.033) 

(0.024) 

(0.023) 

(0.026) 

(0.036) 

(0.026) 

(0.026) 


0.15 


0.234 

0.171 

0.183 

0.192 

0.411 

0.162 

0.178 




(0.030) 

(0.024) 

(0.026) 

(0.023) 

(0.030) 

(0.026) 

(0.027) 


Table 2: Estimated mean and standard deviation (in parentheses) of misclassification error for the two- 
horseshoes setting. Two balanced groups: 7Ti = 1/2. 

- three degrees of variability are considered to model the noise around the two semi-circumferences: 
cr = 0.05,0.10,0.15, (small, medium and high variability). 

After performing the discrimination exercise, one obtains the misclassification error distri¬ 
butions whose summary measures (mean and standard deviation) are collected in tables 2, 

3 and 4. In Figure 10 such error distributions when rij n = 200, a = 0.05,0.10,0.15 are 
reproduced. 

From the tables and plots, it emerges that the SrnBP classifier produces the best results, 
both in the balanced and unbalanced cases when d — 3. This is coherent with the fact that 
FEV( 3) > 99% (see Section 3.1): for fixed n, increasing d further does not produce benefits; 
on the contrary, dimensionality causes a worsening in the classification abilities. Results 
with d = 3 are comparable with the ones of the k-NN and the nonparametric approach. As 
expected, due to the non-spherical nature of data, the GLM approach produces the worst 
results. 

3.3.2. Analysis of real datasets 

In what follows we analyze the performances of our SmBP classifier on three real well- 
known datasets belonging to three very different research domains: electrocardiography, 
growth curves and quality control. The same datasets have been used previously in Jacques 
and Preda (2014) in an unsupervised classification framework. 
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^in 

a 

flout 

d = 2 

d = 3 

d = 4 

d — 5 

GLM 

k-NN 

NP 


0.05 


0.156 

0.051 

0.067 

0.069 

0.342 

0.036 

0.060 




(0.045) 

(0.029) 

(0.038) 

(0.039) 

(0.067) 

(0.023) 

(0.039) 

100 

0.10 

50 

0.208 

0.154 

0.154 

0.172 

0.287 

0.101 

0.163 




(0.044) 

(0.048) 

(0.047) 

(0.045) 

(0.057) 

(0.043) 

(0.043) 


0.15 


0.241 

0.212 

0.225 

0.252 

0.366 

0.175 

0.221 




(0.050) 

(0.048) 

(0.052) 

(0.061) 

(0.058) 

(0.048) 

(0.050) 


0.05 


0.150 

0.035 

0.034 

0.045 

0.340 

0.030 

0.034 




(0.045) 

(0.013) 

(0.014) 

(0.015) 

(0.036) 

(0.014) 

(0.015) 

200 

0.10 

100 

0.211 

0.133 

0.153 

0.175 

0.342 

0.129 

0.146 




(0.038) 

(0.028) 

(0.029) 

(0.036) 

(0.038) 

(0.030) 

(0.029) 


0.15 


0.263 

0.180 

0.188 

0.201 

0.359 

0.169 

0.193 




(0.038) 

(0.035) 

(0.033) 

(0.032) 

(0.042) 

(0.036) 

(0.034) 


0.05 


0.137 

0.036 

0.039 

0.051 

0.327 

0.039 

0.043 




(0.025) 

(0.010) 

(0.012) 

(0.015) 

(0.042) 

(0.015) 

(0.013) 

300 

0.10 

150 

0.155 

0.083 

0.098 

0.100 

0.317 

0.080 

0.096 




(0.028) 

(0.020) 

(0.021) 

(0.022) 

(0.031) 

(0.021) 

(0.021) 


0.15 


0.200 

0.167 

0.170 

0.182 

0.341 

0.154 

0.185 




(0.031) 

(0.029) 

(0.028) 

(0.028) 

(0.038) 

(0.028) 

(0.031) 


Table 3: Estimated mean and standard deviation (in parentheses) of misclassification error for the two- 
horseshoes setting. Two unbalanced groups: 7Ti = 1/3. 


Simulated Curves - n = 200 - a = 0.05 Simulated Curves - n = 200 - a = 0.10 Simulated Curves - n = 200 - o = 0.15 



d = 2 d = 3 d = 4 d = 5 GLM Knn NP d = 2d = 3d = 4d = 5 GLM Knn NP d = 2d = 3d = 4d = 5 GLM Knn NP 


Figure 10: Distributions of misclassification errors estimated over 100 replications when rii n = 200 and 
a = 0.05,0.10,0.15 (form left to right). 
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Hin 

a 

^ out 

d = 2 

d = 3 

d = A 

d = 5 

GLM 

k-NN 

NP 


0.05 


0.112 

0.042 

0.046 

0.064 

0.279 

0.018 

0.040 




(0.048) 

(0.031) 

(0.031) 

(0.042) 

(0.070) 

(0.019) 

(0.035) 

100 

0.10 

50 

0.183 

0.151 

0.156 

0.164 

0.266 

0.082 

0.145 




(0.044 ) 

(0.047) 

(0.047) 

(0.046) 

(0.054) 

(0.041) 

(0.048) 


0.15 


0.169 

0.148 

0.142 

0.161 

0.260 

0.092 

0.147 




(0.042) 

(0.042) 

(0.047) 

(0.049) 

(0.056) 

(0.047) 

(0.040) 


0.05 


0.098 

0.024 

0.031 

0.033 

0.252 

0.020 

0.032 




(0.021) 

(0.013) 

(0.011) 

(0.013) 

(0.037) 

(0.015) 

(0.018) 

200 

0.10 

100 

0.152 

0.121 

0.128 

0.138 

0.258 

0.084 

0.120 




(0.040) 

(0.033) 

(0.033) 

(0.032) 

(0.035) 

(0.025) 

(0.032) 


0.15 


0.181 

0.152 

0.170 

0.179 

0.271 

0.129 

0.150 




(0.034) 

(0.031) 

(0.031) 

(0.031) 

(0.030) 

(0.030) 

(0.032) 


0.05 


0.119 

0.013 

0.021 

0.033 

0.271 

0.018 

0.020 




(0.020) 

(0.009) 

(0.013) 

(0.016) 

(0.033) 

(0.008) 

(0.008) 

300 

0.10 

150 

0.153 

0.127 

0.131 

0.128 

0.265 

0.107 

0.139 




(0.028) 

(0.024) 

(0.025) 

(0.025) 

(0.023) 

(0.024) 

(0.025) 


0.15 


0.174 

0.169 

0.181 

0.194 

0.256 

0.156 

0.181 




(0.027) 

(0.028) 

(0.025) 

(0.027) 

(0.028) 

(0.024) 

(0.025) 


Table 4: Estimated mean and standard deviation (in parentheses) of misclassification error for the two- 
horseshoes setting. Two unbalanced groups: 7Ti = 1/4. 
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The first dataset comes from the UCR Time Series Classification and Clustering website 
(http://www.cs.ucr.edu/~eamonn/time_series_data/). It consists of 200 electrocardio¬ 
graphy (ECG) curves observed at 96 discretization points and related to 2 groups of patients 
(see Olszewski, 2001 for more details). 

The second dataset is the well-known Berkeley growth dataset (see Tuddenham and 
Snyder, 1954). It contains stature measurements for 54 girls and 39 boys, aged from 1 to 
18 years, and observed in 31 (not equispaced) discretization points. To obtain the growth 
curves, the original raw data are preprocessed by fitting each individual set of discretized 
data with a monotone smoothing method (see Ramsay and Silverman, 2005). The aim is to 
discriminate the curves on the basis of gender. 

The third dataset, described in detail in Leveder et al. (2004), comes from Danone Vi- 
tapole Paris Research Center. The aim is to detect the quality of produced cookies in 
relationship with the flour kneading process. Each curve in the dataset collects the measure¬ 
ments of dough resistance during the kneading process at 241 equispaced instants of time 
in the interval [0,480] seconds. Overall, 115 flours are analyzed: 50 of them have produced 
cookies of good quality, 25 of medium quality and 40 of low quality. The goal of the analysis 
is to classify the functional dataset based on the quality of cookies. 

The three datasets of curves are plotted at the top of Figure 11: in the plots, the group 
membership of each individual (patient, child or flour) is highlighted by using different 
colours. The estimated conditional pseudo-densities fd(-\g), with d = 3, are depicted in 
Figure 11. Finally, Table 6 reports the estimated FEV in the three cases. 

We apply our SmBP classification method with balanced groups with the dimension d 
which varies from 2 to 5. The distributions of misclassihcation errors for our approach and 
the competitors are reproduced at the bottom of Figure 11. Moreover, to allow a direct 
comparison, Table 5 collects the estimated mean and standard deviation of misclassihcation 
error distributions for the three real datasets: for the SmBP approach, we report the best 
results obtained and the corresponding dimension d. Unbalanced group structure (requiring 
the estimate of prior probabilities iT g ) does not change the obtained results. 

The results reveal how the SmBP classifier behaves with d: if, in principle, misclassihcation 
errors should reduce with d coherently with FEV and the mixture structure, in practice, 
larger values may amplify the noise due to bad estimation and, in the proposed examples, 
we find that a good compromise between approximation and dimensionality is reached with 
d = 3 (for growth curves and kneading process dataset) and d = 4 (for ECG dataset). 

It is worth noting that our method performs rather well when compared to the other ones: 
despite the fact that the k-NN approach tends to produce good results uniformly in all cases, 
our method is always comparable, with closed results. More in detail, in the growth curves 
case with d = 3, SmBP classifier is equivalent to k-NN and better than the nonparametric 
approach: the quality of the results is explainable by observing the estimated conditional 
pseudo-densities, where a thin overlapping “grey zone” emerges. For the ECG dataset, 
the best result (for d = 4) is comparable with that obtained by nonparametric classifier. 
For what concerns the kneading process dataset, all of the proposed methods suffer from a 
relatively wide overlapping region of the estimated conditional pseudo-densities. 
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ECG Dataset 


Growth curves 


Kneading Dataset 



Figure 11: Top to bottom: curves of the considered dataset, conditional densities of the first three PCs 
scores and Out-of-sample Misclassification errors over 100 replications. 

Case studies (left to right): ECG (2 groups), growth curves (2 groups) and kneading process (3 groups). 
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EGG 

Growth Curves 

Kneading Process 

SmBP 

d = 4 0.148 (0.035) 

d = 3 0.028 (0.024) 

d = 3 0.273 (0.061) 

GLM 

0.213 (0.037) 

0.075 (0.049) 

0.286 (0.070) 

k-NN 

0.090 (0.038) 

0.030 (0.027) 

0.221 (0.068) 

NP 

0.143 (0.036) 

0.041 (0.030) 

0.251 (0.064) 


Table 5: Estimated mean and standard deviation (in parentheses) of misclassification error distribution for 
the three real datasets. For the SmBP approach the dimension d at which we obtain the best results is 
reported. 


d 

1 

2 

3 

4 

5 

EGG 

38.6 

63.4 

73.3 

79.2 

83.0 

Growth Curves 

81.7 

95.9 

98.7 

99.5 

99.8 

Kneading Process 

89.6 

93.2 

95.6 

96.3 

96.8 


Table 6: Estimated FEV for the considered datasets. 


4. Conclusions 

In this paper, an unsupervised and a supervised classification method based on the con¬ 
cept of SmBP mixture for Hilbert-valued process have been introduced and analyzed. The 
novelty lies in the use of the theoretical factorization of the SmBP due to Bongiorno and 
Goia (2015) and reported in Proposition 1. Such a result introduces a surrogate-density 
for Hilbert-valued processes that, on the one hand, endorses a “density oriented” clustering 
approach for detecting the latent structure by incorporating the information on the mixture, 
and, on the other one, leads to define an optimal Bayes classifier in a supervised classifi¬ 
cation (discriminant) context. From a theoretical point of view, the approaches proposed 
here can be seen as semi-parametric: the coefficients of the Karhunen-Loeve decomposi¬ 
tion, truncated at a suitable order d, define the pseudo-density of a mixture model which 
is not entirely specified and is estimated nonparametrically. In this view, the detection of 
a reasonable dimension d which balances the trade-off between a good approximation and 
the curse of dimensionality is an important task. Furthermore, dealing with joint and con¬ 
ditional pseudo-densities leads naturally to represent them graphically, for d < 3, in order 
to understand the underlining mixture “structure” and to evaluate why classification errors 
may arise. In addition to the theoretical aspects, computational issues deserve attention as 
well: especially for clustering, the problem of tuning parameters is deeply considered, and 
known tools are implemented. The use of such standard tools has shown some shortcomings 
strictly related to the open long-standing problem of finding a universal optimal criteria 
for validating clustering procedures. The issue of efficient tuning methods requires further 
study, which is beyond the scope of this work. 








Appendix A. Sketch of the Proofs 

This Section contains a sketch of Propositions 1 and 4; for a detailed and theoretical 
discussion the interested reader can refer to Bongiorno and Goia (2015). 

Appendix A. 1. Sketch of proof of Proposition 1 

At the beginning, fix d € N and consider the quantities 

Si = Si(d,x) = ^ (9j - {x,Cj)f and S = S(d, e, x) = (0j - (x, ^)) 2 . 

j<d j>d +1 


These allow to rewrite the SmBP as follows 

<p(e, x) = P (||A - x|| 2 < e 2 ) = P (Si + e 2 S < e 2 ) = I P (Si < (1 - s)£ 2 ) dG (s), 

J o 

where G is the cdf of S. In terms of f d (•), the probability density function of 6 = (6k,..., 9 d )' , 
it holds 

P (Si < (1 - S)£ 2 ) = [ / rf (d)dd, 

with D — I'd e : Y2j<d — x j ) 2 < £ 2 (1 — s) j. The Taylor expansion of f d about the 
point (xi,... ,Xd) leads to the following first order approximation 


¥>(e, x ) ~ f d (x)V d (e) E (1 - S) d/2 1{s<i} 


£ —y 0 . 


According to the type of eigenvalue decay rate, it is possible to choose d = d(e) as a function 
of e so that it approaches infinity when £ goes to zero and such that 


S^O, 


and 


E 


(1 — S) d/2 1 { s<i } 


->• 1. 


Finally, errors due to both the latter approximation and the Taylor expansion can be simul¬ 
taneously controlled by exploiting the kind of eigenvalues decay rate; in particular, it turns 
out that the faster they decay, the smaller is the total error. 


Appendix A.2. Sketch of proof of Proposition f 

Consider f djTl , the pseudo-estimator for f d , given by 

1 n 

fd,n (n d x) = - V K Hn (||n d (Xi - x)||), u d x e M d , 

n z —' 

i= 1 


that involves the true but unknown projector operator hC (•) = X^/=i By the 

triangle inequality, 


E 


1 2 


fd ( x ) ~ fd,n ( x ) < E [f d (x) ~ f d}n (x)] + E f d ^ n (x) - f d<n (x) 


(A.l) 
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Regarding the first term on the right-hand side of (A.l), it is well known in the literature 
(see for instance Wand and Jones, 1995) that, under Assumptions (B.1)-(B.4) and taking 
the optimal bandwidth (6), one gets the minimax rate: 

E [f d (x)-f d , n (x)] 2 = 0(n~ 2 ^ 2p+ ^) 

uniformly in Regarding the second addend on the right-hand side of (A.l), since H n = 
h 2 /, it holds 


(nh d ) 2 E f d)n ( X ) - fd,n ( X ) 


1 2 


= E 




i— 1 


n 2 


where V) = ||II d (X t — x)||, V) = fl rf (X t — x) . Consider the events A* = {V, < h n }, = 
Vi < h n i; we get that 


(nh d ) 2 E f d)7l (x) - f diU (x) 


<2E 


^Tl/dhl-Whi ii 


2=1 


n 2 


hn J \ h n 




4E 


V i =1 


E A '(vK 


h n J AinBi I \ ^ \ h n I A i nB i 

1=1 


Under Assumptions (B.1)-(B.4) and after some computations, it is possible to show that, 
for any d > 1 and as n —V oo 


(nh%Y 


rE 




2=1 


< C 


h 


2(d—1) 


n 


whereas 


{nliff 
and, hence, 


E 


5>(hu 


, 2=1 


hn / ~ A ^ 




v i=l 


hn / A i nB i 


< c 


,ihl f2d+ V 


1/3 


E 


fd,n (x) ~ fd,n (x) 


= O 


h 


2(d—1)' 


n 


+ 0 


nhl (2d+1) ) 


l/3> 


(A.2) 


For any d > 1, a direct computation shows that, taking the optimal bandwith (6) and 
p > (3d + 2) /2, the bounds in equation (A.2) are definitively negligible compared to the 
“optimal bound” n~ 2p/ A p+d h 
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